This vignette shows how to read and interpret EchoGO outputs using the frozen demo results shipped with the package. It mirrors the EchoGO interpretation guide and manuscript, with emphasis on:
All code here uses frozen demo outputs. No online calls or heavy computations are performed during vignette build.
EchoGO is a modular enrichment and interpretation pipeline designed for:
GOseq: Bias-corrected GO enrichment (e.g., gene length bias). Uses Trinotate-derived GO terms.
g:Profiler: Cross-species enrichment based on orthologs in model species, run in two flavors:
Each term is annotated with:
consensus_score (strict) and
consensus_score_all (exploratory).We start from the main consensus table in the frozen demo snapshot:
cons_xlsx <- file.path(
out, "consensus",
"consensus_enrichment_results_with_and_without_bg.xlsx"
)
stopifnot(file.exists(cons_xlsx))
cons <- read_xlsx(cons_xlsx)
dplyr::glimpse(cons)
#> Rows: 1,326
#> Columns: 40
#> $ term_id <chr> "GO:0003674", "GO:0003824", "GO:0005488", "GO:0…
#> $ term_name.bg <chr> "molecular_function", "catalytic activity", "bi…
#> $ avg_fold_gprof_bg <dbl> 1.064921, 1.127930, 1.106415, 1.047374, 1.08194…
#> $ min_pval_gprof_bg <dbl> 1.520404e-11, 5.738836e-03, 9.942183e-05, 1.399…
#> $ in_drerio <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ in_hsapiens <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
#> $ in_mmusculus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ genes_gprofiler_bg <chr> "SF3B6, SF3B1, SF3A2, DDX42, SF3B3, SNIP1, SLU7…
#> $ term_name.nobg <chr> NA, "catalytic activity", "binding", NA, "intra…
#> $ avg_fold_gprof_nobg <dbl> NA, 1.598807, 1.259853, NA, 1.357215, 1.393627,…
#> $ min_pval_gprof_nobg <dbl> NA, 9.376941e-19, 5.562803e-41, NA, 5.474012e-6…
#> $ in_drerio_nobg <lgl> NA, TRUE, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE,…
#> $ in_hsapiens_nobg <lgl> NA, TRUE, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE,…
#> $ in_mmusculus_nobg <lgl> NA, TRUE, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE,…
#> $ genes_gprofiler_nobg <chr> NA, "DDX42, DDX1, DCPS, SENP7, PPIL1, DHX15, TG…
#> $ term_name <chr> "molecular_function", "catalytic activity", "bi…
#> $ ontology <chr> "MF", "MF", "MF", "CC", "CC", "CC", "BP", "BP",…
#> $ fold_enrichment_goseq <dbl> 0.1488307, 0.1369033, 0.1519897, 0.1489026, NA,…
#> $ min_pval_goseq <dbl> 0.000000e+00, 4.522545e-186, 0.000000e+00, 0.00…
#> $ gene_names <chr> "ISY1, BUD13, SF3B6, U2AF2, SF3B1, DDX46, SF3A2…
#> $ in_goseq <lgl> TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE, TRUE, T…
#> $ depth <dbl> 0, 1, 1, 0, 2, 3, 0, 1, 1, 2, 2, 3, 2, 1, 1, 2,…
#> $ all_genes <chr> "ISY1, BUD13, SF3B6, U2AF2, SF3B1, DDX46, SF3A2…
#> $ num_species_gprof_bg <dbl> 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
#> $ num_species_gprof_nobg <dbl> 0, 3, 3, 0, 3, 3, 0, 3, 3, 0, 3, 3, 3, 1, 2, 1,…
#> $ bg_prev <dbl> 0.6666667, 0.3333333, 0.3333333, 0.6666667, 0.3…
#> $ nobg_prev <dbl> 0.0000000, 1.0000000, 1.0000000, 0.0000000, 1.0…
#> $ is_robust_nobg <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TR…
#> $ sources_count <dbl> 2, 4, 4, 2, 4, 4, 2, 4, 5, 1, 4, 4, 4, 2, 3, 2,…
#> $ comp_p_strict <dbl> 1.0000000, 0.3735294, 0.6670864, 0.8090269, 0.6…
#> $ comp_p_all <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1.0…
#> $ comp_fold_bg <dbl> 0.00000000, 0.09078757, 0.08956580, 0.00000000,…
#> $ comp_fold_nb <dbl> 0.00000000, 0.11482080, 0.09801907, 0.00000000,…
#> $ comp_fold_gs <dbl> 0.00000000, 0.01542579, 0.01701065, 0.00000000,…
#> $ comp_goseq <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ consensus_score <dbl> 0.9333333, 0.4585638, 0.5761318, 0.8569441, 0.5…
#> $ consensus_score_all <dbl> 0.8166667, 0.9492132, 0.9451399, 0.8166667, 0.9…
#> $ origin <chr> "GO terms - Consensus (with BG)", "GO terms - C…
#> $ significant_in_any <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ source_origin <chr> "GOseq+g:Profiler_BG", "GOseq+g:Profiler_BG+g:P…This section documents the scoring logic implemented in
score_consensus_terms(). Two composite scores are
provided:
consensus_score → strict / true consensus
(conservative).consensus_score_all → exploratory (includes
no-background contribution).Both scores are weighted sums of normalized components that capture:
The function also computes:
num_species_gprof_bg,
num_species_gprof_nobg → species support,bg_prev, nobg_prev → normalized prevalence
(0–1),origin, source_origin → human-readable
origin tags,significant_in_any → whether any method supports the
term.Internally, two caps are used:
P_CAP = 6 → used to cap −log₁₀ p-values.FOLD_CAP = 8 → used to cap fold enrichment values.Species support from g:Profiler is summarized as:
num_species_gprof_bg – number of species with a
g:Profiler-with-background hit.num_species_gprof_nobg – number of species with a
g:Profiler-without-background hit.These are converted to prevalence fractions:
where n_bg_cols and n_nobg_cols are the
number of in_* columns for with-/without-background
species. This normalizes counts to 0–1 and avoids overstating prevalence
when only a few species were queried.
P-values are converted to capped, normalized −log₁₀ scores:
cap_logp <- function(p, P_CAP) {
p <- dplyr::coalesce(p, 1)
p <- pmax(p, .Machine$double.xmin)
pmin(-log10(p), P_CAP) / P_CAP # scaled to [0, 1]
}Two derived components are used:
# Strict p-value component: only g:Profiler with background
comp_p_strict <- cap_logp(min_pval_gprof_bg, P_CAP)
# Exploratory p-value component: best signal across all methods
comp_p_all <- pmax(
cap_logp(min_pval_gprof_bg, P_CAP),
cap_logp(min_pval_gprof_nobg, P_CAP),
cap_logp(min_pval_goseq, P_CAP)
)Fold enrichment is capped and then modulated by term depth:
log2p1_cap <- function(x, cap) {
log2(1 + pmin(dplyr::coalesce(x, 0), cap))
}
depth_w <- function(d, maxd = 12L) {
d <- suppressWarnings(as.integer(d))
ifelse(is.na(d), 1, pmin(d / maxd, 1))
}So:
The fold components are:
comp_fold_bg <- depth_w(depth) * log2p1_cap(avg_fold_gprof_bg, FOLD_CAP)
comp_fold_nb <- depth_w(depth) * log2p1_cap(avg_fold_gprof_nobg, FOLD_CAP)
comp_fold_gs <- depth_w(depth) * log2p1_cap(fold_enrichment_goseq, FOLD_CAP)Intuition:
GOseq support is encoded as a simple indicator:
This is 1 if GOseq reports the term as significant, 0 otherwise.
Two sets of weights define how the components are combined:
W_STRICT <- list(
w_goseq = 1.0,
w_bgprev = 0.8,
w_foldbg = 0.40,
w_foldgs = 0.40,
w_p = 0.40
)
W_ALL <- list(
w_goseq = 1.0,
w_bgprev = 0.7,
w_nbprev = 0.3,
w_foldbg = 0.35,
w_foldnb = 0.25,
w_foldgs = 0.35,
w_p = 0.35
)consensus_scoreconsensus_score_allconsensus_score_all <- W_ALL$w_goseq * comp_goseq +
W_ALL$w_bgprev * bg_prev +
W_ALL$w_nbprev * nobg_prev +
W_ALL$w_foldbg * comp_fold_bg +
W_ALL$w_foldnb * comp_fold_nb +
W_ALL$w_foldgs * comp_fold_gs +
W_ALL$w_p * comp_p_allRule-of-thumb:
consensus_score (strict) for main figures and
conservative interpretation.consensus_score_all to expand hypotheses and
identify additional candidate processes.EchoGO organizes each comparison’s outputs into a tidy structure. In the demo snapshot, you can explore the directory tree:
consensus/
plots_exploratory/
plots_strict/
diagnostics/
evaluation/
exploratory_no_bg/
goseq/
gprofiler/
no_background_genome_wide/
with_custom_background/
networks/
with_bg/
with_bg_and_nobg/
report/
rrvgo/
rrvgo_exploratory_all_significant/
OrgDb=...
rrvgo_true_consensus_with_bg/
OrgDb=...
Each folder has a specific interpretation purpose:
Where to start. Includes:
Consensus Excel tables
Curated lollipop plots
plots_exploratory/ (exploratory mode)
plots_strict/ (strict mode)
Internal QC outputs (GOseq diagnostic materials, length-bias checks).
Evaluation plots comparing enrichment methods:
EQI distributions
Fold-enrichment distributions
Rarefaction curves
exploratory_no_bg/ contains additional diagnostic
plots for no-background g:Profiler runs.
Bias-aware GOseq enrichment outputs.
All g:Profiler enrichment results:
with_custom_background/ → conservative,
robust
no_background_genome_wide/ → exploratory
mode
Semantic similarity reduction (BP/MF/CC):
Treemaps
Bubble plots
Heatmaps
Wordcloud
Cluster summary tables
Separate folders for True Consensus and
Exploratory modes.
Gene-overlap GO term networks:
with_bg/ → strict consensus only
with_bg_and_nobg/ → exploratory networks including
no-background terms
Static PDF/SVG + interactive HTML versions.
The automatically generated HTML report when
make_report = TRUE.
The consensus table bundles all key metadata you need: origin, species support, scores, and IDs for plotting and networks.
top_consensus <- cons %>%
dplyr::filter(significant_in_any) %>%
dplyr::arrange(dplyr::desc(consensus_score)) %>%
dplyr::select(
term_id, term_name, ontology,
num_species_gprof_bg, num_species_gprof_nobg,
consensus_score, consensus_score_all,
origin, source_origin
) %>%
head(25)
top_consensus
#> # A tibble: 25 × 9
#> term_id term_name ontology num_species_gprof_bg num_species_gprof_nobg
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 KEGG:00000 KEGG root te… KEGG 1 0
#> 2 GO:0009987 cellular pro… BP 2 3
#> 3 GO:0003674 molecular_fu… MF 2 0
#> 4 GO:0008150 biological_p… BP 2 0
#> 5 GO:0110165 cellular ana… CC 2 3
#> 6 GO:0005575 cellular_com… CC 2 0
#> 7 KEGG:03040 Spliceosome KEGG 1 3
#> 8 GO:0016020 membrane CC 1 0
#> 9 GO:0005737 cytoplasm CC 1 3
#> 10 GO:0005622 intracellula… CC 1 3
#> # ℹ 15 more rows
#> # ℹ 4 more variables: consensus_score <dbl>, consensus_score_all <dbl>,
#> # origin <chr>, source_origin <chr>Typical columns worth focusing on:
term_id, term_name, ontology
– GO identifier, label, and ontology (BP/MF/CC).origin – high-level origin categories, e.g., “GO terms
– Consensus (with BG)”, “GO terms – GOseq only”, “GO terms – g:Profiler
only (no BG)”.source_origin – fine-grained tag summarizing the
combination of sources, e.g., "GOseq+g:Profiler_BG",
"g:Profiler_BG+g:Profiler_noBG".num_species_gprof_bg,
num_species_gprof_nobg – number of species supporting the
term via g:Profiler (with and without background).consensus_score, consensus_score_all –
strict vs. exploratory composite scores.significant_in_any – TRUE if any of GOseq,
g:Profiler with BG, or g:Profiler no-BG report the term as
significant.Figure 1 illustrates how these scores and annotations translate into a ranked list of biological processes in the demo dataset.
Figure 1. Exploratory EchoGO lollipop plot for BP terms in
the demo dataset.
Top exploratory BP GO/KEGG terms are ranked by the EchoGO exploratory
consensus score and plotted as a lollipop chart. Each point corresponds
to a GO or KEGG term, and the x-axis shows the ranking score used by
EchoGO to prioritize terms (higher = stronger, more consistent signal
across tools and species).
Reading tip: For high-confidence biological interpretation, prioritize terms that:
consensus_score,num_species_gprof_bg ≥ 2),EchoGO includes diagnostics to show that consensus results are not arbitrary.
The diagnostics/evaluation folder includes, for example:
EQI_distribution_by_ontologyFoldEnrichment_distribution_by_ontologyIn this vignette we illustrate the EQI distribution (Figure 2).
Figure 2. Enrichment Quality Index (EQI) by GO ontology in the demo dataset. Each point summarizes the agreement between GOseq and g:Profiler across species for a given term and ontology. Higher EQI values indicate terms where different tools and species give more consistent evidence of enrichment.
These plots help you check:
Are consensus terms strongly enriched relative to the GOseq background?
Is there reasonable agreement between GOseq and EchoGO’s g:Profiler-based evidence?
Are particular ontologies (BP/MF/CC) contributing more of the high-quality signal?
If you see terms with very low EQI but high exploratory scores, they are good candidates for “hypothesis-generating” rather than “headline” results.
Rarefaction curves track how many unique GO terms are added as you:
Typical files:
rarefaction_curve_by_origin_true_consensusrarefaction_curve_by_origin_exploratoryFigure 3. Rarefaction curves for exploratory EchoGO terms by
origin.
The exploratory rarefaction curve shows how the number of unique significant terms increases as methods and species are added. A steep initial rise followed by a slower increase suggests that EchoGO captures a core set of robust terms and then adds more speculative terms as additional evidence sources are included.
Interpretation:
RRvGO reduces redundant GO terms into clusters of semantically similar terms, making large tables more digestible.
In the demo snapshot, you should find:
rr_true <- file.path(out, "rrvgo", "rrvgo_true_consensus_with_bg")
rr_all <- file.path(out, "rrvgo", "rrvgo_exploratory_all_significant")
data.frame(
mode = c("True_Consensus_with_BG", "Exploratory_all"),
exists = c(dir.exists(rr_true), dir.exists(rr_all))
)
#> mode exists
#> 1 True_Consensus_with_BG TRUE
#> 2 Exploratory_all TRUEIf present, these folders typically contain:
if (dir.exists(rr_true)) {
head(list.files(rr_true, recursive = TRUE), 10)
}
#> [1] "OrgDb=org.Mm.eg.db/rrvgo_BP_bubbleplot.pdf"
#> [2] "OrgDb=org.Mm.eg.db/rrvgo_BP_clusters.csv"
#> [3] "OrgDb=org.Mm.eg.db/rrvgo_BP_heatmap.pdf"
#> [4] "OrgDb=org.Mm.eg.db/rrvgo_BP_scatterplot.pdf"
#> [5] "OrgDb=org.Mm.eg.db/rrvgo_BP_treemap.pdf"
#> [6] "OrgDb=org.Mm.eg.db/rrvgo_BP_wordcloud.pdf"
#> [7] "OrgDb=org.Mm.eg.db/rrvgo_CC_bubbleplot.pdf"
#> [8] "OrgDb=org.Mm.eg.db/rrvgo_CC_clusters.csv"
#> [9] "OrgDb=org.Mm.eg.db/rrvgo_CC_heatmap.pdf"
#> [10] "OrgDb=org.Mm.eg.db/rrvgo_CC_scatterplot.pdf"Figure 4. RRvGO treemap of BP True Consensus terms in the
demo dataset.
Each rectangle corresponds to a GO term; terms are grouped into semantically similar clusters, and the area of each tile reflects the relative importance or size of that term/cluster. This provides a compact “map” of the major biological themes emerging from consensus enrichment.
Reading tip:
Identify RRvGO clusters enriched in
high-consensus_score terms: these clusters often become
your story units (e.g. “DNA repair”, “oxidative stress”, “immune
activation”).
Within a cluster, representative terms serve as concise labels for reporting, while the full list of child terms can be explored in the RRvGO tables for mechanistic detail.
For more complex projects, RRvGO can also be used in comparison modules (e.g. shared vs. unique terms across treatments) to reduce redundancy there as well.
Network analysis shows how GO terms are connected via shared genes, helping you see modules and hubs.
net_dir <- file.path(out, "networks")
head(list.files(net_dir, recursive = TRUE), 10)
#> [1] "network_complexity_comparison.csv"
#> [2] "summary_with_bg.csv"
#> [3] "summary_with_bg_and_nobg.csv"
#> [4] "with_bg/network_BP_gene_count.graphml"
#> [5] "with_bg/network_BP_gene_count_filtered.html"
#> [6] "with_bg/network_BP_gene_count_filtered.pdf"
#> [7] "with_bg/network_BP_gene_count_filtered.svg"
#> [8] "with_bg/network_BP_gene_count_filtered_files/htmltools-fill-0.5.8.1/fill.css"
#> [9] "with_bg/network_BP_gene_count_filtered_files/htmlwidgets-1.6.4/htmlwidgets.js"
#> [10] "with_bg/network_BP_gene_count_filtered_files/vis-9.1.0/add_css.txt"Typical structure:
with_bg/ → only strict, background-corrected terms
(True Consensus).with_bg_and_nobg/ → includes exploratory terms as
well.Figure 5. GO term overlap network for BP terms in the
exploratory EchoGO set.
Nodes represent BP GO terms, and edges connect terms that share genes. Densely connected regions correspond to functional modules, while highly connected nodes act as hubs linking different processes.
How to read them:
Interpretation strategy:
with_bg/) to identify core
modules.The demo snapshot already includes rich plots under
consensus/. Here we show a very small, CRAN-friendly
example: a lollipop-style plot of the top 10 exploratory consensus terms
by ontology.
# Improved minimal plotting example (exploratory)
plot_df <- cons %>%
dplyr::filter(significant_in_any, !is.na(consensus_score_all)) %>%
dplyr::slice_max(consensus_score_all, n = 15, with_ties = FALSE) %>% # smaller = cleaner
dplyr::mutate(
term = stringr::str_trunc(
paste0(term_name, " (", term_id, ")"),
50
)
) %>%
dplyr::arrange(consensus_score_all) %>%
dplyr::mutate(term = factor(term, levels = term))
ggplot(plot_df, aes(
y = term,
x = consensus_score_all
)) +
geom_segment(aes(
yend = term,
xend = 0
), linewidth = 0.6, color = "grey70") +
geom_point(
aes(color = ontology),
size = 2.5
) +
scale_color_brewer(palette = "Dark2") +
labs(
x = "Consensus score (exploratory)",
y = NULL,
title = "Top 15 exploratory GO/KEGG terms by consensus score"
) +
theme_minimal(base_size = 11) +
theme(
axis.text.y = element_text(size = 8),
legend.position = "bottom",
plot.title = element_text(hjust = 0.5)
)When you run EchoGO on your own data, a robust interpretation workflow is:
consensus/
significant_in_any == TRUE.consensus_score (strict) and scan top
terms.diagnostics/ or
evaluation/)
rrvgo/) to reduce
redundancy
networks/) for
systems-level stories
For more detail, see:
vignette("EchoGO_workflow") for running the
pipeline.#> R version 4.4.3 (2025-02-28 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_4.0.1 tibble_3.3.0 stringr_1.6.0 tidyr_1.3.1
#> [5] dplyr_1.1.4 readxl_1.4.5 EchoGO_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.6 rappdirs_0.3.3 sass_0.4.10 generics_0.1.4
#> [5] xml2_1.5.1 stringi_1.8.7 hms_1.1.4 digest_0.6.37
#> [9] magrittr_2.0.4 evaluate_1.0.5 grid_4.4.3 RColorBrewer_1.1-3
#> [13] fastmap_1.2.0 cellranger_1.1.0 jsonlite_2.0.0 zip_2.3.3
#> [17] httr_1.4.7 rvest_1.0.5 purrr_1.2.0 scales_1.4.0
#> [21] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6 withr_3.0.2
#> [25] cachem_1.1.0 yaml_2.3.10 tools_4.4.3 tzdb_0.5.0
#> [29] ggvenn_0.1.19 vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
#> [33] pkgconfig_2.0.3 pillar_1.11.1 bslib_0.9.0 openxlsx_4.2.8.1
#> [37] gtable_0.3.6 glue_1.8.0 Rcpp_1.1.0 xfun_0.52
#> [41] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50 farver_2.1.2
#> [45] htmltools_0.5.8.1 igraph_2.2.1 patchwork_1.3.2 labeling_0.4.3
#> [49] rmarkdown_2.30 readr_2.1.6 compiler_4.4.3 S7_0.2.1